24        Bioinformatics

The GC content (%GC) is important for showing sequencing problem due to bias. There

is a relationship between GC content and read coverage across a genome. Some short-read

sequencers may tend to sequence the region with higher GC content more or less than the

region with the lower GC content [11]. The average of the GC content of bacterial genome

varies from less than 15% to more than 75% [12], and the average genomic GC content of

most eukaryotes lies somewhere within 40%–50% [13]. Very small or very large GC con-

tent may indicate a potential sequencing bias problem that we may need to fix.

1.5.2  Per Base Sequence Quality

Per Base Sequence Quality graphs (Figure 1.14) show box plots of the quality distributions

on each position across all bases of the reads in the forward and reverse FASTQ files. The

position indexes are plotted in the x-axis against the Phred quality scores in the y-axis. The

higher the quality score, the better the base call.

The maximum limit of the x-axis indexes depends on the length of the reads if they have

the same length; otherwise, it will be the number of bases in the longest read. Each box

plot displays the five-number summary of the base quality scores in that position. The five-

number summary includes the minimum quality score (the lower end of lower whisker),

first quartile (the lower end of the box), median (the red line), third quartile (the upper end

of the box), and maximum quality score (upper end of the upper whisker). The yellow box of

a box plot represents the interquartile range (IQR), which is the middle 50% of the quality

scores of the bases in that position. The longer the box, the more the spread of the quality

scores. The blue line on the graph represents the mean of the quality scores.

The background of the graph is divided into three regions. The quality scores on the

green region (Q>28) are very good, the scores on the orange region (20 < Q <28) are reason-

able, and the scores on the red region (Q<20) are poor. In general, for the reads produced

by Illumina, the per base qualities degrade toward the end of the reads.

If the lower quartile of the quality scores for any position is less than 20, an orange

warning will be displayed in front of “Per Base Sequence Quality” as shown in Figure 1.14.

The red warning (failure) is displayed if the lower quartile for any position is less than 5.

We can notice that the average quality of the base drops toward the end of the reads and

it is greatly deteriorated in the 36th base. To avoid using low-quality sequence data in the

FIGURE 1.13  The basic statistics of the paired-end FASTQ files.